In this Exploratory Data Analysis (EDA) I will explore a dataset about the quality of red wine. This dataset contains 13 variables and 1599 observations. There are informations about the quality level, different acids/ acidity, residual sugar, alcohol, density and pH level.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
In the two plots above we can see the 13 variables as well as the structure of our dataset.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
This plot shows the first 6 rows (default) of the table, so I can get familiar with the values and columns.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The plot above is the statistical summary of the dataset and gives a better idea of the values the variables here can take on. You can see that the mean quality is 5.6 and has a range from 3.0 to 8.0. The alcohol level has a mean of 10.42, a mean pH level of 3.3 and a mean of volatile acidity of 0.5.
Both plots look a little right skewed, so it would be a good idea to transform the data with log10.
Now we have a normal distribution with base log10.
These plots for citric.acid and residual.sugar look also very right skewed, so I will transform the data like above with log10.
These are the plots p3 and p4 as a normal distribution.
As observed in the plots p1 to p4, these two plots are right skewed too and has to be transformed with log10.
Now we have a normal distribution for both sulfur dioxid values.
These plots for density and pH level look both normal distributed, so we don’t have to transform them.
These are the plots for chlorides and sulphates, which are also right skewed.
Now the plots for chlorides and sulphates are normal distributed.
The plot for alcohol looks right skewed so I will transform this with log10.
A new variable was created for the rating of wine out of the variable ‘quality’.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality wine_rating
## 1 5 medium
## 2 5 medium
## 3 5 medium
## 4 6 medium
## 5 5 medium
## 6 5 medium
Here you can see that the new variable ‘wine_rating’ was created and added to the dataset.
We can observe now that the alcohol level is mostly less than 10 and the range of quality is between 3 and 8 with peaks around 5 and 6.
Here we see the plot for the new variable ‘wine_rating’, which shows clearly that the most red wines in this dataset were rated medium.
## [1] 0
There are no missing values in the dataset.
## [1] 143 145 468 589 653 822 1115 1133 1229 1270 1271 1476 1478
The boxplot shows some outliers in the variable ‘alcohol’.
The dataset of RedWineQuality has 13 variables with 1599 observations. The variables list diverse chemical compounds of wine like volatile and fixed acidity, citric acid, chlorides, sulfur dioxide, sulphates, alcohol and residual sugar. But the variables also measurements like density, the level of pH and quality of wine. There are no missing values in the dataset.
Most interesting in this dataset are the quality of wine compared to the level of alcohol, the pH level with the level of volatile acidity and residual sugar, the sulfur dioxide and eventually the density.
For a better distribution of the quality of wine I created a new variable called ‘wine_rating’ from ‘quality’. This variable set markers for low, medium and high quality of red wine. In the plot you can clearly see, that the most red wines were rated with medium quality.
In my first outputs I looked visually at the variables and their values. After this I performed plots for each variable to see the distribution more clearly. For most of the variables I found that they were right skewed, so I transformed these plots with log10 to have a normal distribution. I also found some outliers in volatile.acidity, citric.acid, residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, pH and alcohol.
## volatile.acidity citric.acid residual.sugar free.sulfur.dioxide density
## 1 0.70 0.00 1.9 11 0.9978
## 2 0.88 0.00 2.6 25 0.9968
## 3 0.76 0.04 2.3 15 0.9970
## 4 0.28 0.56 1.9 17 0.9980
## 5 0.70 0.00 1.9 11 0.9978
## 6 0.66 0.00 1.8 13 0.9978
## pH alcohol quality
## 1 3.51 9.4 5
## 2 3.20 9.8 5
## 3 3.26 9.8 5
## 4 3.16 9.8 6
## 5 3.51 9.4 5
## 6 3.51 9.4 5
## volatile.acidity citric.acid residual.sugar
## volatile.acidity 1.00 -0.55 0.00
## citric.acid -0.55 1.00 0.14
## residual.sugar 0.00 0.14 1.00
## free.sulfur.dioxide -0.01 -0.06 0.19
## density 0.02 0.36 0.36
## pH 0.23 -0.54 -0.09
## free.sulfur.dioxide density pH alcohol quality
## volatile.acidity -0.01 0.02 0.23 -0.20 -0.39
## citric.acid -0.06 0.36 -0.54 0.11 0.23
## residual.sugar 0.19 0.36 -0.09 0.04 0.01
## free.sulfur.dioxide 1.00 -0.02 0.07 -0.07 -0.05
## density -0.02 1.00 -0.34 -0.50 -0.17
## pH 0.07 -0.34 1.00 0.21 -0.06
Here we have our correlation matrix to see where are the strongest correlations between variables.
## Var1 Var2 value
## 1 volatile.acidity volatile.acidity 1.00
## 2 citric.acid volatile.acidity -0.55
## 3 residual.sugar volatile.acidity 0.00
## 4 free.sulfur.dioxide volatile.acidity -0.01
## 5 density volatile.acidity 0.02
## 6 pH volatile.acidity 0.23
Heatmap of the strongest correlations in the dataset.
Here we have the boxplots for quality vs. alcohol and quality vs. residual.sugar.
Here we have the scatterplots for the relationship of the variables in the boxplot above.
Here we have the boxplots for pH vs. volatile.acidity and residual.sugar vs. pH.
Here we have the boxplots for residual.sugar vs. volatile.acidity and alcohol vs. pH.
These are the scatterplots showing the relationship between residual.sugar vs. volatile.acidity and alcohol vs. pH level.
Here we have the boxplots for density vs. residual.sugar and density vs. citric.acid
These are the scatterplots for density vs. residual.sugar and vs. citric.acid.
Here we see the boxplots for the relationship of residual.sugar vs. free.sulfur.dioxide and for quality vs. citric.acid.
These are the proper scatterplot and boxplot for the first boxplots above.
We see a higher level of quality, if we have less volatile.acidity. There are some outliers with a higher level of residual.sugar and also a higher level (>6) of citric.acid - this could be an interesting question for level of quality.
My main interest was for the relationship of the variables (a) density vs. citric.acid or residual.sugar, for (b) pH level vs. volatile.acidity and (c) free.sulfur.dioxide vs. residual.sugar. For (a) density vs. residual.sugar shows for the mean of residual.sugar at 2.5 a strong relationship with the density. This is interesting because the density is determined by the concentration of alcohol and sugar. This plot can be improved in the multivariate plots section. For (a) density vs. citric.acid we have a somewhat strong relationship too. The relationship for (b) pH level vs. volatile.acidity is interesting because it grades the wine for tart and soft wine. In the plot we can observe that the pH level is mostly between 3.0 and 3.5/3.6 and the level of volatile.acidity between 0.2 and 0.8. This plot can also be improved in the multivariate section to see more relationships. The next interesting relationship is between the free.sulfur.dioxide and residual.sugar. In the plot we can see, that at a low level of sugar (less than 4) we have different concentration of free.sulfur.dioxide (between around 5 to 40). The trend shows that for higher level of sugar we have higher values of free.sulfur.dioxide. This chemical compound is important to preserve the flavor after harvest and saves the wine from further fermantation so you can store the wine for many years.
Yes, in the end I was searching for some interesting insights in other relationships between variables. For the relationsship between the variables residual.sugar vs. citric.acid I found that there are some outliers with a higher level of residual.sugar and also a higher level (>6) of citric.acid. This could be an interesting question for level of quality. I also looked at the relationship of the variables quality vs. volatile.acidity. Here we see a higher level of quality, if we have less volatile.acidity. This might lead to the assumption that some softer wines are rated higher than tart wines.
The strongest relationship from the correlation table is quality vs. alcohol (correlation coefficent of 0.48). This one is closely followed by density vs. citrid.acid (0.36) and density vs. residual.sugar. This is interesting because the density is determined by the concentration of alcohol and sugar, which might be important for the quality and wine_rating.
This plot shows the relationship between the variables alcohol, residual.sugar and quality as this is supposed to have the strongest relationship. We see that the quality level tends to be higher with a higher level of alcohol.
This plot shows the relationship between alcohol, density and quality. We can observe that with a higher level of alcohol and a low level of residual.sugar we have the lowest level of density.
This plot shows the quality more clearly, because we used the created variable ‘wine_rating’. As you see there are some outliers for medium rating with a lower level of alcohol and a high level of sugar. But mostly the high rated wines have a level of residual.sugar lower than 8, mostly lower than 6 and a level of alcohol betwenn 10 and 14.
Here we see the highest wine_ratings for pH levels mostly between 3.0 and 3.5 and a volatile.acidity level of less than 0.6. For higher level of volatile.acidity we see a lower wine_rating.
This plot shows the quality for red wines based on the realtionship of residual.sugar and free.sulfur.dioxide.
This plot shows two different results. First that with a lower level of residual.sugar and free.sulfur.dioxide the wine_rating is more likely to be high or medium. But there are also values that shows that with a little higher level of residual.sugar but lower level of free.sulfur.dioxide the wine_rating is more likely to be high. We also see some outliers for a lot of residual.sugar and/ or more residual.sugar and high leves of free.sulfur.dioxide having a wine_rating of medium.
I continued to investigate the relationships I had in the bivariate analysis. In the multivariate analysis I now added a variable, mostly quality or wine_rating to find out, which wines had been highly rated and for what reason. First I looked at the relationship between alcohol, residual.sugar, density and quality, later wine_rating. Most interesting was that the quality level tends to be higher with a higher level of alcohol and a medium level of residual.sugar. Next I took a look at the relationship between volatile.acidity and pH level. This one was really interesting because here you can see if soft or tart wine is rated more highly. In this case wines with a lower volatile.acidity and also only a pH level of 3.0 to 3.5 are preferred. Last I wanted to know more about the rating for the relationship of free.sulfur.dioxide and residual.sugar. Sulfur dioxide is taken for fermantation purposes to get wines which can be stored for years and also for saving the flavor of the grapes after the harvest. I found out that this plot was not so clear since there are two possible directions. One result was that with a lower level of residual.sugar and free.sulfur.dioxide the wine_rating is more likely to be high or medium. And the second result ist that with a little higher level of residual.sugar but lower level of free.sulfur.dioxide the wine_rating is also more likely to be high.
One thing I haven’t expected is that residual.sugar doesn’t seem to have that much impact on the quality/ wine_rating. I would have also expected that some plots will have a more clear trend in one direction.
The dataset of RedWineQuality was very interesting to explore. My final plots summarize the relationship for some of the variables with the strongest correlation as well as variables which seems to have an impact on the rating for the quality of wine. The trend is that red wines which have a lower volatile.acidity are more preferred. These wines tend to be more softer and balanced, which is reasonable to me. Usually red wines have a pH level of 3.3-3.6, white wine usually have a pH level of 3.0-3.4. Most of the red wines here have a pH level between 3.0 and 3.5. In combination with the volatile.acidity these wines tend to be more soft. But also with a higher level of alcohol, which then was rated much higher. The rating of the quality of wine was medium with a trend for high.
I choose the combination of these two plots because they show the level of alcohol and residual.sugar. The alcohol level seems to be more often lower than 10 or 12 and the level of residual.sugar also shows a lower level which tends to be lower than 4 on a scale of 10.
This plot explains the quality levels for volatile.acidity, which tend to have a higher level of quality for lower level of volatile.acidity. This is interesting because this shows that more balanced wines are preferred.
Finally, I choose this plot to show the relationship between volatile.acidity and the pH level, because this gives us some information about the kind of wine if it is a more soft or tart wine. We see that the highest rating was for wines with a pH level between 3.0 to 3.5 and a level of volatile.acidity less than 0.8, mostly less than 0.6.
The dataset of RedWineQuality was very interesting to explore. Maybe it would have been helpful to have more variables since some of the given variables didn’t seem to have much impact. But this was a first exploration, so out of this a second and third one could follow up. One of the struggles I had were to find a trend out of the data, because in some plots it seems there have been more than one result and so it was not always easy to interpreted the data for the next step. Although some plots were easy going. As expected the quality level was the highest for the medium wine_rating. Maybe this is also some kind of social desirability as it maybe was not always easy to determine if one wine was better than the other. Some interesting future work to follow up with could include the comparision between red and white wines. Are there different ratings in total and which wine is highly preferred - white wine or red wine? Further interesting questions would be for the region of the wine or which grapes have been used. Also which age and sex were the participants who rated the wine, is there a preference of white or red wine for one sex or age or both. Also if female and/or male participants drink wine more often or just occasionally. It must also be consideres that not everyone has a lot of experience to rate a wine and it’s taste/flavor etc.